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ABSTRACT 

In many information networks, data items - such as updates 
in social networks, news flowing through interconnected RSS 
feeds and blogs, measurements in sensor networks, route 
updates in ad-hoc networks - propagate in an uncoordi- 
nated manner: nodes often relay information they receive 
to neighbors, independent of whether or not these neigh- 
bors received the same information from other sources. This 
uncoordinated data dissemination may result in significant, 
yet unnecessary communication and processing overheads, 
ultimately reducing the utility of information networks. To 
alleviate the negative impacts of this information multiplic- 
ity phenomenon, we propose that a subset of nodes (selected 
at key positions in the network) carry out additional infor- 
mation filtering functionality. Thus, nodes are responsible 
for the removal (or significant reduction) of the redundant 
data items relayed through them. We refer to such nodes as 
filters. We formally define the Filter Placement prob- 
lem as a combinatorial optimization problem, and study its 
computational complexity for different types of graphs. We 
also present polynomial-time approximation algorithms and 
scalable heuristics for the problem. Our experimental re- 
sults, which we obtained through extensive simulations on 
synthetic and real-world information flow networks, suggest 
that in many settings a relatively small number of filters 
are fairly effective in removing a large fraction of redundant 
information. 

1. INTRODUCTION 

Information networks arise in many applications, includ- 
ing social networks, RSS-feed and blog networks, sensor net- 
works and ad-hoc networks. In information networks, con- 
tent propagates from content creators, i.e., sources, to con- 
tent consumers through directed links connecting the vari- 
ous nodes in the network. The utility of an information net- 
work has been long associated with its ability to facilitate 
effective information propagation. A network is considered 
highly functional, if all its nodes are up-to-date and aware 
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of newly-generated content of interest. 

Motivation: A common characteristic of many informa- 
tion networks is that content propagation is not coordinated: 
nodes relay information they receive to their neighbors, in- 
dependent of whether these neighbors have received such 
information from other sources. This lack of coordination 
may be a result of node autonomy (as in social networks), 
limited capabilities and lack of local resources (as in sensor 
networks), or absence of topological/routing information (as 
in ad-hoc networks). As an example consider the users' feed 
in Facebook. The feed displays all content (e.g. videos) 
shared by the friends of a particular user U. In cases where 
two or more friends of U share the same content, this content 
appears multiple times in U's feed. Such uncoordinated in- 
formation propagation results in receiving the same or simi- 
lar content repeatedly, we call this phenomenon information 
multiplicity. The redundancy underlying information multi- 
plicity results in significant, yet unnecessary, communication 
and processing overheads, ultimately reducing the utility of 
information networks. 

As an illustration of information multiplicity, consider the 
toy news network shown in Figure 1, in which branded syn- 
dicated content propagates over directed edges. In this ex- 
ample, node s is the originator of new information - the 
syndicated content - whereas nodes x and y are distributors 
(e.g., newspapers) that may add branding or advertisement 
to the syndicated content received from s. All nodes other 
than s do not generate new content; rather, they utilize their 
connections to relay content to other nodes along directed 
links. 



Figure 1: Illustration of information multiplicity. 

Now assume that a single news item i reaches x and y after 
its generation by s. Nodes x and y forward a copy of i to 
their neighbors, and as a result, z\, zi and 23 receive i as 
well. In fact, Z2 (unnecessarily) receives two copies of i; one 
from x and one from y. Even worse, if z\, z% and 23 forward 
whatever they get to w, then w receives (1 + 2 + 1) copies 
of i. Clearly, to inform w, one copy of i is enough. 

In many network settings we observe propagation of mul- 
tiple instances of the same underlying piece of information. 
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For example in online media networks information multi- 
plicity arises by different news sites posting different articles 
about the same topic. In sensor networks information about 
the same measurement is propagated by different sensors. 

To alleviate the negative implications of information mul- 
tiplicity, we propose that a subset of nodes be strategically 
selected and equipped with additional "filtering" functional- 
ity, namely the removal (or significant reduction) of similar 
data items relayed through them. In some cases, filtering 
could be an inexpensive process when only exact matching 
techniques are used such as in broadcast by flooding scenar- 
ios [31]. In this case, all nodes can be equipped with this 
functionality. In other cases, filtering may cause significant 
overheads to identify similar but not identical content, {e.g. 
image [25] , video processing [5] , time series analysis [3] , con- 
tent with different branding or presentation). Due to this 
overhead, the deployment of such a filtering functionality 
cannot be justified, except at a small number of nodes. 

We refer to nodes that carry out such special functionali- 
ties as filters. We refer to the problem of identifying the set 
of filter nodes in a given network as the Filter Placement 
problem. 

Notice that the placement of filters does not impact the 
utility of the network in any other way; filters simply remove 
similar content. In the example above, placing two filters at 
22 and w completely alleviates redundancy. 

Paper Contributions: To the best of our knowledge, we 
are the first to address the issue of network information mul- 
tiplicity, and the first to propose a solution based on the 
deployment of filtering functionality at strategic locations 
in these information networks. We formally define the Fil- 
ter Placement problem as a combinatorial optimization 
problem, and study its computational complexity for differ- 
ent types of graphs. We present polynomial-time constant- 
factor approximation algorithms for solving the problem, 
which is NP-hard for arbitrary graph topologies. We also 
present a set of computational speedups that significantly 
improve the running time of our approximation algorithm 
without sacrificing its practical utility. Our experimental 
results, which we obtained through extensive simulations on 
synthetic and real-world information flow networks, suggest 
that in many settings a relatively small number of filters are 
fairly effective in removing a large fraction of duplicative 
information. 

2. RELATED WORK 

While our work is the first to motivate, formalize, and 
study the Filter Placement problem, the literature is 
full of other works in various application domains that ad- 
dress two related problems: the use of coordinated content 
dissemination, and/or the selection of nodes to carry out 
special functionalities. In this section, we provide a review 
of these works. 

Centrality in Networks: Shortest-path distance between 
nodes in a network play an important role in many applica- 
tions including transportation, protein interaction and so- 
cial networks. The importance of a node in such a network 
is defined by its betweenness centrality. This measure com- 
putes for every node the total number of shortest paths that 
node lies on. This measure can be generalized to a subset 
of the nodes in the graph; Group betweenness computes the 
number of shortest paths, that a set of nodes cover. Bran- 



des [2] gives an effective algorithm to compute betweenness 
for nodes and sets of nodes in a graph. Potamias et al. [28] 
show how nodes with high centrality can be used to give 
an effective approximation of the distance between all node 
pairs in a network. The main point in these works is to 
find nodes which lie on as many shortest paths as possi- 
ble. The Filter Placement problem is not related to 
betweenness centrality. Content does not only propagate 
along shortest paths. The goal of Filter Placement is 
rather to cover as many paths as possible. In the example 
given in Figure 1, nodes with the highest betweenness cen- 
trality are x and y. However, the only node where we can 
apply meaningful filtering functionality in this graph, is z2. 
This example demonstrates that finding a set of nodes with 
high betweenness centrality does not necessarily solve the 
Filter Placement problem. 

Social Networks: The problem of identifying k key nodes 
or agents has been addressed in the context of many different 
social-network studies and applications. 

For example, a number of studies focused on the identi- 
fication of k influential nodes such that information seeded 
(e.g., advertisements) at these nodes would be maximally 
spread out throughout the network [7, 11, 13, 30]. All ex- 
isting variants of this influence-maximization problem are 
concerned with improving the extent of information spread 
in the network, even if such maximization results in signif- 
icant information redundancy. Our work is complementary 
in that it does not aim to change or improve information 
spread. Rather, our work aims to identify the placement of 
k filter nodes so as to minimize redundant communication 
and processing of information without changing the original 
extent of information spread. 

Another line of work focuses on the identification of k 
nodes that need to be monitored (and/or immunized) in or- 
der to detect contamination (prevent epidemics) [1, 14, 18, 
27]. Here, the goal from selecting these key nodes is to in- 
hibit as much as possible the spread of harmful information 
content (e.g., viruses) by insuring that the selected nodes 
act as barriers that stop/terminate the flow of such content. 
In our model, filters do not terminate or inhibit the prop- 
agation of information. Rather, they remove redundancy 
in order to propagate a streamlined/sanitized ("cleaner") 
version of the information content. As a result, the under- 
lying combinatorial problem we encounter when solving the 
Filter Placement problem is different from the combi- 
natorial problems addressed before. 

Sensor Networks: At a high level, our work is related to 
mechanisms for information flow management in sensor net- 
works. In sensor networks, nodes are impoverished devices 
- with limited battery life, storage capacity, and process- 
ing capabilities - necessitating the use of in-network data 
aggregation to preserve resources, and/or the use of coor- 
dination of node operations for effective routing and query 
processing. In-network aggregation is not aimed at removal 
of redundant data, but at the extraction of aggregate statis- 
tics from data (e.g., sum, average, etc). Therefore, studies 
along these lines focus on the design of data-aggregation 
strategies [6, 10, 12, 19] as well as associated routing of 
queries and query results [15] in order to allow for reduced 
communication costs. Here we note that there is an implicit 
tradeoff between resource consumption and the quality of 
information flow. An aggregate is effectively a caricature 
(an approximation) of the information; aggressive aggrega- 
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tion implies a reduction in the quality (and hence utility) of 
information flow. 

Coordinated communication of sensory data is exemplified 
by recent work that went beyond standard aggregation and 
focused on minimizing the communication cost required for 
the evaluation of multi-predicate queries scheduled across 
network nodes [4]. In that work, a dynamic-programming 
algorithm is used to determine an optimal execution order 
of queries. Examples of other studies that proposed coor- 
dination strategies include techniques that ensure efficient 
storage, caching, and exchange of spatio-temporal data [22, 
23] or aggregates thereof [21]. 

In addition to their focus on balancing information fidelity 
and resource consumption, there is an implicit assumption 
in all of these studies that all nodes in the network are un- 
der the same administrative domain, and hence could be 
expected to coordinate their actions {e.g., as it relates to 
routing or wake-sleep cycles). In our work, we consider set- 
tings in which coordination of node operation is not justified 
(since nodes are autonomous). Instead, we focus on reduc- 
ing overheads by adding functionality to a subset of nodes, 
without concern to resource or energy constraints. 

Finally, broadcast by flooding in ad-hoc networks is an 
active topic of research. A good comparison of suggested 
methods is done by Williams and Camp [31]. In its basic 
model, a node propagates the broadcast message it receives 
to all of its neighbors. The goal is to devise propagation 
heuristics, such that all nodes in the network receive the 
content, but at the same time the number of exact messages 
propagated in the network is minimized to avoid network 
congestion. In this setup, the propagated items are exactly 
the same, thus comparison of items, by storing a fingerprint 
of the received content, is a relatively cheap operation for 
nodes. The emphasis of existing research in this area is more 
on effective ways of information spreading. Our work is 
applicable in a different domain of problems, namely where 
spread of information is given, but comparison of content 
is expensive and not every node can be equipped with this 
functionality. 

Content Networks: A common challenge in large-scale 
access and distribution networks is the optimal placement 
of servers to optimize some objective {e.g., minimize average 
distance or delay between end-points and the closest content 
server) - namely, the classical facility location and k-median 
problems [20], for which a large number of centralized [20, 8] 
and distributed [24, 9] solutions have been developed. Con- 
ceptually, the Filter Placement problem could be seen 
as a facility location problem, wherein filters constitute the 
facilities to be acquired. However, in terms of its objective, 
the Filter Placement problem is fundamentally different 
since there is no notion of local measures of "distance" or 
"cost" between installed facilities and end-points. Rather, in 
our setting, the subject of the optimization is a global mea- 
sure of the impact of all facilities (and not just the closest) 
on the utility that end-points derive from the network. 

3. THE FILTER PLACEMENT PROBLEM 

Propagation model: In this paper, we consider networks 
consisting of an interconnected set of entities {e.g., users, 
software agents) who relay information items {e.g., links, 
ideas, articles, news) to one another. We represent such a 
network as a directed graph G{V, E), which we call the com- 



munication graph (c-graph). Participants in the network 
correspond to the nodeset V. A directed edge {u,v) G E in 
the graph represents a link, along which node v can prop- 
agate items to u. Some nodes of G generate new items by 
virtue of access to some information origin; we call these 
nodes sources. Sources generate distinct items - i.e., any 
two items generated by the same source are distinct. Once 
an item is generated, it is propagated through G as follows: 
every node that receives an item blindly propagates copies of 
it to its outgoing neighbors. Since the propagation is blind, 
a node might receive many copies of a single item, leading 
to information multiplicity. 

Our information propagation model is deterministic in the 
sense that every item reaching a node is relayed to all neigh- 
bors of that node. In reality, links are associated with prob- 
abilities that capture the tendency of a node to propagate 
messages to its neighbors. Although our results (both the- 
oretical and experimental) continue to hold under a proba- 
bilistic information propagation mode, for ease of presenta- 
tion and without loss of generality we adopt a deterministic 
propagation model in this paper. Moreover, even though the 
propagation model defined by G could be used to communi- 
cate multiple items, in this paper we focus on a single item 
i. The technical results are identical for the multiple-item 
version of the problem. 

Filters: Consider the production of a new item i by source 
s. In order for node v to receive this item, i has to travel 
along a directed path from s to v. Since there are several 
paths from s to v, node v will receive multiple copies of i. 
Moreover if v has children, then v propagates every copy 
of i it receives to each one of its children. To reduce the 
amount of redundancy (underlying information multiplic- 
ity), our goal is to add a filtering functionality to some of 
the nodes in G. We use filters to refer to nodes augmented 
with such filtering functionality. 

A filter can be viewed as a function that takes as input a 
multiset of items / and outputs a set of items /', such that 
the cardinality of 7' is less than the cardinality of /. The 
specific filtering function depends on the application. We 
emphasize, that for many applications such filtering may be 
costly to implement (for example resource-intensive). For 
ease of exposition, we will fix the filter function to be the 
function that eliminates all duplicate content: 1 for every 
item the filter node receives, it will perform a check to de- 
termine if it has relayed an item with similar content before. 
If not, it propagates the item to all of its neighbors. A filter 
node never propagates already propagated content. 

Objective Function: Let v G V be an arbitrary node 
in the graph. Define $(0,i>) as the number of (not neces- 
sarily distinct) items that node v receives, when no filters 
are placed. Let A C V be a subset of V. Then &{A, v) 
denotes the number of items node v receives, when filters 
are placed in the nodes in A. For a subset X C V, let 

For a given item of information, the total number of mes- 
sages that nodes in V receive in the absence of filters is 
5>(0, V). For filter set A, the total number of items that 
nodes in V receive is &{A, V). Thus, our goal is to find the 
set A of k nodes where filters should be placed, such that 
the difference between $(0, V) and &{A, V) is maximized. 

1 Generalizations that allow for a percentage of duplicates to 
make it through a filter are straightforward. 
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Problem 1 (Filter Placement-FP). Given directed 
c-graph G(V, E) and an integer k, find a subset of nodes 
A C V of size \A\ < k, which maximizes the function 

F(A) = $(0, V) - &(A,V). 

Another choice for an objective function in Problem 1 
would be to minimize <&(A, V), and by that maximize the 
number of items covered. However this function has short- 
comings which make it undesirable: as we observed, every 
copy of an item corresponds to a directed paths in the graph. 
The total number of directed paths in a graph is typically 
exponential in the number of nodes. Hence, even covering 
an exponential amount of paths may result in a relatively 
low level of redundancy elimination. Also, as shown in the 
example in Figure 1, not all redundancy can be eliminated. 
As a consequence, the number of paths covered is not a good 
indicator of the quality of filtering. 

The objective function F(A) described in problem 1 over- 
comes these shortcomings by measuring the improvement in 
redundancy reduction. In addition, it has some nice prop- 
erties. It is always positive and monotone, since placing an 
additional filter can only reduce the number of items. This 
also implies that F is bounded: F(0) = < F{) < F(V). 
Furthermore, function F is submodular, since for every X C 
Y C V and v $ Y, it holds that F(X U {v}) - F(x) > 
F(Yu{v}) - F(Y). 

In the definition of Problem 1, there is a bound k on the 
size of the filter set A. In the next proposition we show 
that when the number of filters is not bounded, finding the 
minimal size filter placement that maximizes FP is trivial. 
Throughout the paper we use n = \V\ to denote the number 
of nodes in G. 

Proposition 1. LetG(V,E) be a directed c-graph. Find- 
ing the minimal sized set of filters A C V, such that F(A) — 
F(V) takes 0(\E\) time. 

Proof. Let the filter set A consist of the nodes v G V 
that are not sinks and di n (v) > 1, i.e., A = {v G V\di n (v) > 
1 and dout(w) > 0}. With this choice of A, either the node 
is a sink, is in A or it has indegree 1. Hence, every node 
propagates at most one copy of an item to its children. This 
shows the optimality of A. It is easy to see that A is a 
minimal optimal set; omitting any node from set A would 
result in unnecessary content duplication. Finding set A 
needs one traversal of the graph, to determine the nodes with 
indegree greater than 1 . This has running time proportional 
to the number of edges (0(|_E|)) in the graph. □ 

Despite this result, FP for a filterset of fixed fc-size on 
an arbitrary graph is NP-complete. (For the proof see the 
appendix.) 

Theorem 1. The FP problem on an arbitrary c-graph 
G(V, E) is NP-complete. 

4. FILTER PLACEMENT ALGORITHMS 

In this section, we present algorithms for the FP prob- 
lem on different types of c-graphs, namely trees, DAGs and 
arbitrary directed graphs. 

4.1 Filter Placement in a Tree 

While FP is hard on arbitrary graphs, and as we will show 
also on DAGs, it can be solved in polynomial time with 



dynamic programing on c-trees. We call a graph G(V, E) 
a communication tree (c-tree), if in addition to being a c- 
graph, G is a tree when removing the source node. The 
recursion of the dynamic programming algorithm is done on 
the children of every node. Transforming the input c-tree G 
into a binary tree makes it computationally more feasible. 
This transformation can be done in the following way: for 
every node v £ V, if d ou t{v) < 2 then do nothing. Otherwise, 
fix an arbitrary ordering t>i , t> 2 , . . . tv of the children of v, 
where d out (v) — r. Let v\ be the left child of v. Create a 
new node u\ and let that be the right child of v. Let the 
remaining children of v be the children of u\ . Repeat these 
steps until u r -i has only two children: v, — i and v T . The 
edges adjacent to the source will continue to be connected 
to the nodes in V. The resulting binary tree is G' . Observe 
that the number of nodes in G' is at most twice as much as 
the number of nodes in G. Also notice that if the maximum 
out-degree in tree G is A then the height of G' is at most a 
factor of A — 1 larger than G. 

We apply dynamic programming on G'\ Let OPT(v, i, A) 
be the function that finds an optimal filter set A of size 
\A\ < i < k m the subtree rooted in v. Then for every 
i = . . . k we can compute OPT(t>, i, A) by 

OPT(i>, i, A) = max{ 

max {OPT(v t ,j, A) + OPT(v r ,i - j, A)}, 

j = Q...i 

max {OPT(v h j, A U {v}) + OPT(v r , i-l-jAU M)jj. 

]=0...i-l 

In the equation above, vi and v r denote the left and right 
child of v. The first term of this recursion corresponds to the 
case where we do not place a filter in v, hence a total number 
of i filters can be placed in the subtrees. The second term 
corresponds to the case when we do place a filter in v and 
only i— 1 can be placed in the subtrees. The optimal solution 
for the whole tree is then OPT(r, k, A) where r G V is the 
root of the tree. The above recursion does not guarantee 
that we do not choose any dump nodes of the binary tree. 
For this reason, we omit the second term of the recursion 
when v is a dump node. Building the binary tree takes 
O(nA) time. We have to evaluate the recursion 0(k) times 
for every node. One evaluation takes 0(k) computations. 
There are 0(nlog(A)) nodes in G', which makes the total 
running time 0(nA + fc 2 nlog(A)). 

4.2 Filter Placement in DAGs 

Consider c-graphs G(V, E), which are directed and acyclic 
(DAGs). Although DAGs seem to have simpler structures 
than arbitrary graphs, the FP problem is NP-complete even 
on DAGs. The proof can be found in the appendix. 

Theorem 2. The FP problem is NP-complete when the 
c-graph G(V,E) is a DAG. 

In the remainder of this section, we propose polynomial- 
time algorithms, that result in solutions for the FP problem, 
that we prove to be effective in our experiments. 

First, we start with a naive approach, Greedy_l (G_l)(this 
name will be meaningful later, when we propose heuris- 
tics that can be viewed as extensions of Greedy_l). Con- 
sider node v G V; v will receive items on its incoming 
edges and will forward a copy of every item on every one 
of its outgoing edges. We can now compute a lower bound 
on the number of copies of an item that v is propagating: 
m(v) = di n (t>) x d out (v). Greedy_l computes m(v) for every 



421 



v £ V and chooses the k nodes with the highest m() values. 
Computing m() depends on the way the graph is stored. In 
general it takes 0(|_E|) time, since we need to compute the 
in and outdegree of every node. Finding the k largest values 
of m() requires O(kn) computations, which makes the total 
running time of Greedy_l 0(kn + \E\). 




Figure 2: For k = 1 Greedy_l places a filter in B while 
the optimal solution would be to place a filter in A. 

Although Greedy_l is simple and efficient, for many graphs 
it does not yield an optimal solution as exemplified in Fig- 
ure 2. Without any filters, the total number of received 
items in the graph is f4. Greedy_l would place a filter in 
node B; this is because m(B) — 1 x 4 is the largest m() 
value in this graph. However the optimal solution would be 
to place a filter in A, for which m(A) = 3 x f . Placing a 
filter in B leaves the number of received items unchanged. 
Making node A a filter instead, would yield a total number 
of 12 received items. 

In light of this illustrative example, we propose Greedy _A11 
(G_A11), which is an improved extension of Greedy_l. The 
main idea behind Greedy _A11 is, to compute for every node 
v G V, how many copies of a propagating item i are gener- 
ated because of v. The algorithm then greedily chooses the 
node with the highest such number to put a filter. 

First, we look at the propagation of an item i that is 
created by the source s. In order for i to be received by 
node v, i has to propagate along at least one path from 
s to v. In fact, v will receive as many copies of i as the 
number of paths from s to v. We denote the number of dis- 
tinct directed paths from any node x to y by #paths(x, y). 
Then, the number of copies of i that arrive in v can be ex- 
pressed as #paths(s,«). For later clarity we need to make 
the distinction between the number of paths between two 
nodes and the number of copies of i that v receives. We 
denote the latter value by Prefix(«). Observe, that for 
now #paths(s, v) = Prefix(«). We denote by Suffix(-u) 
the number of copies of i, that are generated across the 
whole graph, after i has propagated through v. Similarly to 
Pref ix(v), the Suffix of v is also related to directed paths 
in the graph. In this simple case the Suffix is equal to 
the total number of distinct directed paths, that start in v. 
Suffix(w) = ~}2 xev #paths(u, x). Observe now, that the 
total number of copies of i that propagate through v is the 
product Prefix(-u) X Suffix(«). 

To study the effects of placing a filter in v, observe that 
even if node v were a filter, it would still propagate one 
copy of i. In other words, placing a filter in v has the same 
effect on the number of copies of i, as if Pref ix(v) — 1. 
Hence, the amount of redundant copies of i generated be- 
cause of the propagation through v, can be expressed by 
I(v) = (Prefix(w) - 1) x Suffix(w). We call I(v) the im- 
pact of v. The impact can also be expressed in terms of the 
objective function: F(v) = $(0, V) - $({«}, V) = I(v). 



This is the concept underlying our Greedy _A11 algorithm: 
The algorithm first chooses the node with the highest im- 
pact. Placing a filter at that node might change the impact 
of other nodes. Hence, an update of the impact of every node 
is required. Then Greedy _A11 chooses again the node with 
the highest impact. The algorithm repeats this for k steps. 
Observe that Greedy _1 only looks at the immediate neigh- 
bors of every node, whereas Greedy _A11 is more exhaustive 
and considers all the nodes in the graph. 

Algorithm 1 Greedy _A11 algorithm 

Input: DAG G(V,E) and integer k. 

Output: set of filters ACV and \A\ < k. 
1: find topological order a of nodes 
2: for i = 1 . . . k do 
3: for j = 1 ... n do 
4: compute I{vj) 

5: A 4- argmax 1)6V /(«) 
6: return A 



Because of the properties of F(), we can use the well- 
known result by Nemhauser et al. [26] , which states that for 
any maximization problem, where the objective function is a 
positive, monotone and submodular set-function, the greedy 
approach yields a (1 — ^-approximation. 

Theorem 3. The Greedy _A11 algorithm for problem 1 ts 
an (1 — ^-approximation. 

Implementation of Greedy _A11. In order to compute the 
impact, we need to compute the Prefix and Suffix of 
every node. We can think of Pref ix(«) as the number of 
copies of an item i that v receives. Since the copies of i are 
propagated through the parents of v, it is easy to see that the 
Prefix of a node is the sum of the prefixes of its parents. We 
can compute the Prefix of every node recursively, by first 
computing the Prefix of its ancestors. We fix a topological 
order a of the nodes. (A topological order of nodes is such 
an order, in which every edge is directed from a smaller to 
a larger ranked node in the ordering.) This order naturally 
implies, that the parents of a node precede it in the ordering. 
Traversing the nodes of G in the order of o the Prefix can 
be computed with the recursive formula (1) (n„ denotes the 
set of parents of v). 

Prefix(u) = ^ Prefix(a;) (1) 

Remember, that the Prefix of a node can also be ex- 
pressed as #paths(s, v), thus formula (1) is equivalent to 

Prefix(w) = #paths(s,«) = ^ #paths(s, z) (2) 

As we established before, Suffix(w) is equivalent to the 
total number of directed paths starting in v. This can 
be computed effectively by doing some bookkeeping dur- 
ing the recursive computation of (1): For every node v we 
will maintain a list, plist^ that contains for every ances- 
tor x of v the number of paths that go from x to v. Thus 
plist„[a;] = #paths(a;, v). Observe now, that for an an- 
cestor x of v, plist v [x] can be computed as the sum of the 
plist of the parents. 

Vx e V : plist Jx] = J2 P list pH ( 3 ) 

psn„ 
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Observe, that for every node, plist^ can be computed dur- 
ing the same recursion as (1). 

To compute the Suffix of a node v, we need to sum the 
number of paths that start in v. This is simply the sum of 
the plist entries, that correspond to v. 

Suffix(v) = #paths(u, •) = P list iH ( 4 ) 

As a technical detail, in order to use this recursive for- 
mula, every node's plist contains itself with value one: 
plist^ [v] = 1. As a special case, a sources list would contain 
only the entry corresponding to itself. 

Thus far, we described how to compute the impact of a 
node by computing its Prefix and Suffix when there are 
no filters in the network. Remember now our earlier obser- 
vation, that placing a filter in a node v* has the same effect 
on the number of copies of an item, as if there was only 
one path leading from the source to v*. We can capture 
this effect, by setting plist vt [«*] = 1 and all other values 
plist tJt (a;) = 0, before using this list in the recursion. Ob- 
serve that the change in plist^ changes the Suffix of all 
nodes preceding v*, and the Prefix of all nodes succeeding 
it. For this reason, we need to make a pass over the whole 
graph when updating the impact. 

Running time of Greedy _A11. The topological order a of 
the nodes can be computed in linear time. This value will 
be evanescent in the total running time of the algorithm. 
Formulas (1) and (3) are updated along every edge of the 
graph. Formula (1) can be updated in constant time along 
an edge, while formula (3) requires O(A) lookups and ad- 
ditions. (Where A corresponds to the maximal degree in 
the graph.) To compute Suff ix(«) we keep a counter for 
every node v and according to formula (4) update that on- 
line when the plist entries are computed. This yields a 
total running time of 0(\E\ ■ A) for one iteration of the al- 
gorithm. Greedy _A11 has k iterations, which yields a total 
running time of 0(k ■ A • \E\). This can be 0(k ■ n 3 ) in worst 
case, but in practice, for most c-graphs \E\ < 0(n ■ logn), 
which results in a 0(k ■ n ■ logn) running time. 

Observe that Greedy _A11 is optimal for k = 1. For larger 
values of k it also yields very good results. In our ex- 
periments we found, that there are real-life graphs, where 
Greedy _A11 is capable of finding an FP which yields perfect 
filtering. However in some cases it does not find the opti- 
mal solution. Look at the toy example in Figure 3. When 
no filters are placed the total number fo received items is 
$(0, V) = 26. Since for k = 1 the impact values of the nodes 
are 1(A) = 7, 1(B) = 6, 1(C) = 6, Greedy _A11 will choose 
A as its first filter. Observe now, that the total number 
of items has been reduced with the amount of A's impact: 
$({A},V) = 19. Then, for k = 2 nodes B and C have 
updated impact: I(B\A) = 3, I(C\A) = 4. The algorithm 
will choose C. This yields a total of $({A,C},V) = 15 
received items in this system. The optimal solution would 
be to place filters in nodes B and C, which would result in 
<£>({£?, C}, V) = 14 items received. 

Computational speedups. Our experimental evaluation 
(Section 5) shows, that although Greedy_All yields good 
results with respect to our objective function, it is rather 
inefficient to apply to large datasets. For this reason, we 
propose two new heuristics, that along with Greedy_l, yield 
much faster and yet effective solutions to FP. Both heuristics 




Figure 3: For k=2 Greedy_All chooses filter set 
{A, C}, while the optimal solution is {B, C}. 

are inspired by the principles behind Greedy _A11. 

The first algorithm, Greedy _Max (G_Max) computes the im- 
pact of all nodes in the graph, similar to Greedy _A11. Once 
the impacts are calculated, Greedy _Max selects the k nodes 
with the highest impact as filters, without recomputation 
of the impact. As our experiments show, Greedy _Max finds 
solutions very similar to those found by Greedy _A11. The 
running time of this algorithm is 0(n\E\), since the impact 
of nodes needs to be computed only once. 

Our other heuristic is Greedy _L (G_L). This algorithm com- 
putes a simplified impact for every node: I'(v) — Pref ix(v) x 
d ou t(v). This is the number of items v propagates to its im- 
mediate children. Then Greedy _L picks the top k nodes with 
respect to I' as filters. To compute /'(), we need to compute 
Prefix () (Equation (1)) and we need to know the degree of 
every node. Both tasks can be accomplished by traversing 
the edges once in Greedy _L. I' is updated in every iteration, 
which yields a total running time of 0(k\E\). 

The three algorithms proposed above are all significantly 
faster than Greedy _A11. This superior running time is due 
to the fact that these algorithms choose the filters leveraging 
less information about the graph. All three heuristics cap- 
ture different characteristics of a good filter set, and hence 
their performance is not the same on different datasets. 
First observe, that a well connected node in the network 
has high in and out degrees and thus, would be ranked high 
by Greedy_l. On the other hand, G_l does not take into ac- 
count the location of the node in the network. Greedy _Max 
computes the full impact of every node, thus it can give a 
more accurate estimate of the influence of individual nodes, 
than Greedy_l. However, it fails to capture the correlation 
between filters placed on the same path and thus, might 
choose filters that diminish each other's impact. Greedy _L 
overcomes these shortcomings by combining these two meth- 
ods. However, this algorithm tends to pick nodes further 
away from the source, since the Prefix () of nodes grows 
exponentially with the distance from the source. The differ- 
ences in performance for various datasets are shown in our 
experimental evaluation (Section 5). 



Algorithm 2 The Greedy _L algorithm on DAGs. 
Input: DAG G(V,E), integer k 
Output: set of filters A C V of size k. 
1: A = 

2: for i — 1 ... k do 

3: compute Pref ix() 

4: A <— argmax ugv Pref ix(v) 

5: return A 
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4.3 Filter Placement in General Graphs 

Solving FP on general graphs is NP-hard. In this section 
we propose a heuristic to choose an acyclic subgraph from 
any graph. This allows us to apply the algorithms designed 
for DAGs on this subgraph. 

Let c-graph G'(V,E') be a general directed graph. Fix 
an arbitrary ordering a of the nodes in V. We call an edge 
(v,u) G E' a forward edge if o~(v) < o(u); otherwise it is 
a backward edge. A well-known 2-approximation for choos- 
ing an acyclic subgraph is the following greedy algorithm: 
fix an arbitrary order a of the nodes. Let F be the set of 
forward edges with respect to o, and let B be the set of 
backward edges. If \F\ > \B\ then choose the DAG G(V, F), 
else choose the edges in B: G(V,B). The drawback of this 
algorithm is that it does not guarantee the resulting DAG 
to be connected. For this reason we develop our own algo- 
rithm, Acyclic (Algorithm 3) to choose a connected acyclic 
subgraph. 

The Acyclic algorithm finds an acyclic subgraph in two 
steps. We can assume that there is only one source s in 
G' , otherwise we create a new super-source s, and direct an 
edge from s to every source. First, Acyclic performs a DFS 
traversal of G' starting in s. Every edge that is used during 
this traversal is added to G. Second every remaining edge 
in E' is considered for addition. An edge e G E' is added to 
E if it does not create a cycle. Observe, that the resulting 
acyclic subgraph is maximal, since no edge can be added 
without creating a cycle. 

Acyclic is built on the observation made in the previous 
section; an item i that is generated by s reaches a node v 
if there is at least one directed path from s to v. For this 
reason, it is clear that every node that receives a copy of i 
is visited by the DFS traversal in the first part of Acyclic. 
Nodes that are not visited, do not receive copies of i, thus 
uninteresting with regard to information propagation in G' . 
The edges used during the DFS traversal result in a spanning 
tree T of G, thus making G connected. 

In the second part of Acyclic edges are added to G in a 
greedy fashion: every edge e € E' is considered and is added 
to E if it does not result in a directed cycle. A naive ap- 
proach for doing this would be to add the edge e in question 
to G, and then run a DFS to determine whether G U {e} 
is still acyclic. If not, then remove e. However, this would 
require too many computations. 

Our approach uses instead a decision mechanism based on 
the location of nodes in T. We call the order in which nodes 
are first visited during the DFS traversal the nodes discovery 
time, and denote it by <r( ). We call a node a junction if it 
has more than one child in T. Due to the DFS traversal, 
there can be no forward edge with regard to <r( ) in E 1 , that 
is not an edge in T. A backward edge (u, v) can be added to 
E if there is no directed path from v to u. This is the case if 
v and u are in different branches of the tree. Thus, there is a 
junction w, for which paths (w, w u i), (w u i,w U 2) ■ ■ ■ (w ur ,u) 
and (w, w v i), (w v i, w V 2) . . . (w v i, v) are in T and w u i and w v i 
are different. In order to decide the existence of such atuwe 
need to keep for every node u a signature: sign(w) contains 
a list of pairs {(to, w u i)}- Where the first elements w of the 
pairs are the junctions on the path (s — > u). Observe that 
o~(w u i) is always less or equal to a(u). Also, for any branch 
starting in w either all nodes' discovery times in that branch 
are smaller or all are larger or equal than a(w„i). When an 
edge (u, v) is considered for addition now, we scan sign(w) 



and sign(-y). We find w with the largest a(w), such that 
(w,w u i) G sign(tt) and (w,w v i) G sign(v). u and v are in 
different branches (thus edge (u, v) can be added) only if 
a(v) < a(w u i) < a{u)). 

The DFS traversal in the first phase of Acyclic takes 
0(n ■ logn) time. To create the signatures we need to tra- 
verse T once. Every node w passes on its signature list to 
its children. Every child u w of w adds w to the list, if w 
is a junction, otherwise it uses sign(w) unchanged. This 
introduces an additional O(n) steps to the algorithm. For 
an edge («, v) the comparison of sign(u) and sign(v) takes 
O(logn) time. This needs to be repeated for every edge 
in E' , yielding a total running time of 0(n 2 ■ logn) for the 
Acyclic algorithm. 

Algorithm 3 Acyclic algorithm to find maximal acyclic 
subgraph. 

Input: directed graph G'(V,E') with source s 

Output: acyclic directed graph G(V,E) 
1: DFS traversal starting in s 
2: E <- T 

3: compute signatures 

4: for (it, v) G E' do 

5: if cr(v) < a{w u \) < a(u) then 

6: E<-(u,v) 



5. EXPERIMENTAL EVALUATION 

We present here a comparative experimental evaluation 
of the various Filter Placement algorithms, using syn- 
thetic data and real datasets. These datasets capture the 
propagation of information in real-life scenarios. We report 
the performance of the various algorithms with regard to 
the objective function as well as the running time. 




(a) x = \ 
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Figure 4: CDF of indegrees for synthetic graphs 

Performance Metric: To measure and compare the effec- 
tiveness of various algorithms, we define the Filter Ratio 
(FR) as the ratio between the objective function using fil- 
ters deployed in a subset of nodes (A) and the maximum 
value of the objective function -i.e., the level of redundancy 
reduction delivered by installing filters at the nodes in A. A 
FR of 1 underscores complete elimination of redundancy. 



FR(A) = 



F(A) 
F(V) 



Baseline Algorithms: We compare our algorithms to a 

set of random heuristics that serve as baseline. 

Random _k (Rand_K): chooses k filters from V uniformly at 

random. 
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(a) x = \ 

Figure 5: FR for 

Random_Independent (Rand_l): Every node becomes a filter 
independently of other nodes, with probability -. 
Random_Weighted (RandJf): Every node v is assigned a weight 

w 0) = E ue c„ d~W> where Cv = G v \( v > u ) G E } isthe 
set of children of v. Then, every node becomes a filter with 
probability w(v) x -. The intuition behind this is, that the 
influence of node v on the number of items that its child u 
receives, is inversely proportional to the indegree of u. 

Note that while we cannot guarantee that the number of 
filters for randomized algorithms is k, they are designed so 
that the expected number of filters is k. We run the ran- 
domized algorithms 25 times and then average the results. 

Results using synthetic datasets: To test some basic 
properties of our algorithms we generate synthetic graphs. 
First, we assign nodes to 10 levels randomly, so that the ex- 
pected number of nodes per level is 100. Next, we generate 
directed edges from every node v in level i to every node u in 
level j > i with probability p(v,u) — -^=i- The choice of x 
and y influences the density of the graph. The exponent of y 
is designed in such a way, that nodes in nearby classes have 
higher probability of being connected, than nodes that are 
far apart. We choose to experiment with the combinations 
(x,y) = (1,4) and (x,y) = (3,4). For | = \ we generate 
a graph with 1026 nodes and 32427 edges. For - = | we 
generate 1069 nodes and 101226 edges. The CDFs of the 
indegree are shown in Figure 4. The CDFs of the outdegree 
are quite similar, and thus omitted due to space limitations. 
Observe that nodes on the same level have similar proper- 
ties; the expected number and length of paths going through 
them is the same. 

Figures 5(a) and 5(b) reveal a gradual increase in FR as 
a function of the number of filters. This shows that the cho- 
sen filters are nodes that cover roughly equal-sized, distinct 
portions of all the paths in the graphs. 

Results using the Quote dataset: The Quote dataset by 
Leskovec et al. [17] contains the link network of mainstream 
media sites ranging from large news distributors to personal 
blogs, along with timestamped data, indicating the adop- 
tion of topics or different phrases by the sites. We utilize 
monthly traces from August 2008 to April 2009 to generate 
a c-graph G_Phrase. Since the Quote graph, which has over 
400K edges, is very large, in our experiments we select a 
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synthetic graphs 

subgraph. The subgraph we chose contains the nodes and 
adjacent edges, corresponding to sites that use the phrase 
"lipstick on a pig" . Sites may freely link to each other, which 
might result in cycles. We run Acyclic to find a maximal 
acyclic subgraph in this graph. There is no clear initiator of 
the phrase in the blogosphere, since it was used by a candi- 
date during the 2008 presidential campaign. For this reason, 
we run Acyclic initiated from every node in the graph, and 
then choose the largest resulting DAG. This DAG has a sin- 
gle source: the node Acyclic was started from. It contains 
932 nodes and 2,703 edges. 
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Figure 6: CDF of node indegree for G_Phrase 



Figure 6 shows the CDF of the nodes' in-degree in G_Phrase. 
We found that almost 70% of the nodes are sinks and almost 
50% of the nodes have in-degree one. There are a number 
of nodes, which have both high in- and out-degrees. These 
are potentially good candidates to become filters. The steep 
curve of FR for G_Phrase in Figure 7 confirms our intuition: 
as few as four nodes achieve perfect redundancy elimination 
for this dataset. As expected, Greedy _A11 performs the best 
with regard to the FR. Greedy _Max picks a different node 
for = 2, but for k > 3 it picks the same filter set and 
hence performs as good as Greedy_All. The two heuristics, 
Greedy_l and GreedyX are just a little bit less effective. 
We tracked the connection between the node chosen first by 
Greedy _A11 (node A), and the node chosen first by Greedy _L 
(node B). These nodes were connected by a path of length 
2. The impact of A is larger than the impact of B, since A 
also influences all of B's children. However B is chosen over 
A by Greedy _L, since the prefix of B is much larger than 
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that of A. The four central nodes chosen by Greedy_All 
also explain why Random_Weighted performs so well: nodes 
with large weights (and thus high probability of becoming 
filters) are those nodes with a large number of children. The 
randomized algorithms Random_k and Random_Independent 
perform significantly worse than all others because of the 
high fraction of sink nodes in the graphs. 
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Figure 7: FR for G_Phrase on the Quote dataset; x- 
axis corresponds to the number of filters, y-axis cor- 
responds to the FR for different algorithms 

Results using the Twitter dataset: The Twitter dataset 
was collected by Kwak et al. [16] 2 . The dataset contains user 
ids and links between users, directed from the user to his fol- 
lowers. The complete dataset contains over 41 million user 
profiles. In order to select a subgraph of feasible size for our 
experiments, we first ran a breadth-first search up until six 
levels, starting from the user "sigcomm09". Our goal was to 
find a subnetwork of users related to the computer science 
community. For this, we created a list of keywords related to 
computer science, technology and academia and filtered the 
user profiles of the followers according to that. The resulting 
network is an acyclic graph with a single root "sigcomm09" , 
which we consider the source of information in this sub- 
network. The graph contains about 90K nodes and 120K 
edges. The number of out-going edges from the different 
levels of the graph show an exponential growth: 2, 16, 194, 
43993 and 80639 for levels 1,2,. . . , 5. We had to remove 
a small number of edges, in order to maintain an acyclic 
graph. Observe that this graph is quite sparse compared to 
the other datasets. Figure 8 shows that Greedy _A11 can re- 
move all redundancy with placing as few as six filters. Our 
other heuristics also perform well. Greedy_Max, Greedy_l 
and Greedy _L all achieve complete filtering with at most ten 
filters. The convergence of FR to one for Greedy _L is slower, 
as for the other algorithms, because of its tendency to choose 
nodes further away from the source. 

Results using APS research dataset: The APS research 
dataset contains of the citation network of over 450,000 ar- 
ticles from Physical Review Letters, Physical Review, and 
Reviews of Modern Physics, dating back to 1893. The dataset 
consists of pairs of APS articles - one citing the other. This 
can be viewed as a graph with a directed edge from node 
A to B if B cites A. We select article [29], published in 

2 Available at http://an.kaist.ac.kr/traces/WWW2010. 
html 

3 Available at https://publish.aps.org/datasets 



Physical Review as the source node, and take the subgraph 
of nodes that can be reached from this node through di- 
rected paths. In this case only the node corresponding to 
paper [29] is connected to the source. The resulting sub- 
graph is intended to portray the propagation of an original 
concept or finding in this paper through the physics com- 
munity: a filter in this setting can be seen as an opportune 
point in the knowledge transfer process to purge potentially 
redundant citations of the primary source {e.g., derivative 
work). 4 The resulting citation graph G_Citation is acyclic 
and contains 9,982 nodes and 36,070 edges. As for G_Phrase 
and the Twitter graph, G_Citation has a power-law distri- 
bution of in and out degrees. (Plot omitted due to space 
constraints.) 
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Figure 8: FR for the Twitter graph. 2>axis corre- 
sponds to the number of filters, y-axis corresponds 
to the FR for different algorithms 
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Figure 9: FR for G_Citation in the APS dataset; x-axis 
corresponds to the number of filters, y-axis corre- 
sponds to the FR for different algorithms 



While Greedy_l, Greedy _L or Greedy _Max all converge to a 
high level of removing redundant items with less, than fifteen 
filters, it is evident from Figure 9, that Greedy _A11 performs 
here better than the alternatives. The G_Citation graph 
is a good illustration of the potential shortcomings of our 
heuristics. As sketched in Figure 10, the graph has a set of 



4 In a live corpora of interconnected documents and cita- 
tions, filters can be seen as the key documents in which to 
consolidate references to a specific source. 
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Figure 10: Sketch of APS graph 

nine nodes, interconnected by a path, that all have indegree 
one. All paths from the upper to the lower half of the graph 
traverse through these nodes, which makes them all high- 
impact. However, placing a filter in the first node highly 
diminishes the impact of the remaining nodes. This remains 
unobserved by Greedy _Max resulting in the long range over 
which G_Max is constant. 

Summary of comparative performance: Comparing 
the results for the synthetic data sets (Figures 5(a) and 
5(b)) to the results for the real datasets (Figures 7, 8 and 9) 
reveals a significant difference in the marginal utility from 
added filters (the steepness of the performance curve). For 
the synthetic data sets, there is a slow gradual improvement 
in performance for all algorithms as the number of filters 
increases, suggesting a fairly constant marginal utility of 
additional filters throughout. In contrast, for the Quote and 
Twitter datasets, almost all redundancy can be filtered out 
with at most ten filters, implying no marginal utility from 
added filters beyond that point. For the APS data set, there 
is more of a spread in the performance of the algorithms, 
but the best algorithms have very steep performance curves 
as well. 

This disparity between the results on synthetic and real 
data sets can be explained by the structure of the underlying 
c-graph. The synthetic graphs are densely connected, and 
as a result paths cannot be covered with a small number of 
filters. On the other hand, the real data sets have a small 
set of "key" nodes that cover all paths in the graph. In 
conclusion, while our methods can be used effectively on 
all types of graphs, placing filters is more effective when 
operating over sparse graphs (which are more prevalent in 
many information networks). 

Running times: Although all our algorithms have running 
time polynomial in the number of nodes in the dataset, we 
also investigate their efficiency in practice. For that, we 
report here their actual running times in one of our datasets. 
Note that the conclusions we draw are identical for other 
datasets and thus we omit the results due to lack of space. 

We implemented our algorithms in Python, and although 
our implementation uses clever data structures and other 
necessary optimizations, our code is not focused on optimiz- 
ing performance. Therefore, our efficiency study should be 
seen as an evaluation of the relative efficiency of the differ- 
ent algorithms. For our experiments we used a machine with 
4GHz AMD Opteron with 256GB of RAM, running a 64-bit 
Linux CentOS distribution. 

Figure 11 reports the running times of the different algo- 
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Figure 11: Execution times for the placement of ten 
filters in the case of the Twitter dataset. 

rithms for the Twitter dataset in seconds for k — 10 fil- 
ters. Obviously, Greedy_l (with worst-case running time of 
0(|i?|)) is our fastest algorithm with running time less than 
a minute. Greedy _A11 is the most computationally intensive 
method, with a running time of 83 minutes. Since it requires 
the recomputation of the impact of every node in every it- 
eration. Finally, Greedy_Max and Greedy_L appear to have 
similar running times, approximately 60 minutes. Despite 
the fact, that Greedy _Max does the computation of the im- 
pact only once. As we have seen, the tendency of Greedy _L 
is to pick nodes at the end of the topological order (away 
from the source). After selecting such a node v as filter, the 
modified impact I' of most of the nodes remains the same; 
the only nodes whose value of I' changes are those that are 
after v in the topological order. Since there is small number 
of such nodes, clever bookkeeping allows us to make this 
updates in, practically, constant time. 

Overall, our algorithms Greedy_l, Greedy _Max and Greedy _L 
are much more efficient than Greedy _A11 and can be applied 
to larger datasets. This observation, combined with the fact 
that these algorithms have high-quality results make them 
appropriate to solve the FP problem in practice. 

6. CONCLUSIONS 

Networks in which nodes indiscriminately disseminate in- 
formation to their neighbors are susceptible to information 
multiplicity - i.e., a prevalence of duplicate content com- 
municated over a multitude of paths. In this paper, we pro- 
posed a particular strategy to mitigate the negative implica- 
tions from information multiplicity. Our approach relies on 
the installation of filters at key positions in the information 
network, and accordingly defined the Filter Placement 
problem. In addition to characterizing the computational 
complexity of Filter Placement (polynomial for trees, 
but NP-hard for DAGs and arbitrary graphs), we devised 
a bounded-factor approximation as well as other scalable 
heuristic algorithms. We evaluated our methods experimen- 
tally using synthetic and real data sets. Our results sug- 
gest that in practice, our Filter Placement algorithms 
scale well on fairly large graphs, and that typically a small 
number of filters is sufficient for purposes of removal of re- 
dundant content in real-world (typically sparse) information 
networks. 

Our current and future work is focusing on extensions to 
the information propagation model adopted in this paper 
to take into consideration multirate information sources. In 
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addition, we are investigating different formulations of the 
Filter Placement problem in which the filter functional- 
ity goes beyond the deterministic and probabilistic content 
filtering considered in this paper. 
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APPENDIX 

In the appendix we will proof Theorems 1 and 2. 

Theorem 1. The FP problem on an arbitrary c-graph 
G(V, E) is NP-complete. 

PROOF. First of all observe that for the communications 
graph G the set A C V maximizes F(.) whenever $(A, V) 
is minimized. Hence we will prove the hardness of Prob- 
lem 1 by showing that finding a placement of k filters A 
such that &(A, V) is minimized is NP-complete. We prove 
this by showing that finding the smallest integer k, for which 
the number of received items in graph G(V, E) is finite, is 
equivalent to the SetCover problem. An instance of Set- 
Cover consists of the universe U = {ui, ui, . . . , u m } and a 
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set S = {Si, S2, ■ ■ ■ , S„}, where Vi, Si C U is a subset of U 
and fe is an integer. The goal is to find a subset S' C S such 
that |S"| < fc and {«.,■ €U :Uj € U Sz es'Sz} = 17. Define the 
instance of FPas follows. First, define graph G by creating 
a node Vi for every set Si G S. Fix an arbitrary cyclic order 
a of the nodes Vi. A cyclic order is the same as a linear 
order with the additional constraint that a(n + 1) = cr(l). 
For every instance Uj £U add an edge Vj 1 — >• Vj 2 whenever 
u G Sj 1: u G Sj 2 and cr(«j 1 ) < cr(vj 2 ). Observe that this 
adds a directed cycle to the graph for every u G U. Also add 
a source v s to the graph and add an edge from the source to 
all other nodes in the graph. Imagine now that the source 
creates one single item and propagates that to its children. 
Observe that now infinite number of items will propagate on 
every directed cycle. Let k be the integer, specified in the 
instance of SetCover and let I be an arbitrary finite inte- 
ger. Now for the decision version of the FPproblem if the 
answer tot he question "7s there a filter assignment A with 
\A\ < k, such that 3>(A,v) < IT' is "YES", then this also 
implies a solution with k sets for the SetCover problem. 
Since the decision version of SetCover is NP-complete this 
reduction shows that FPis also NP-complete. □ 

Theorem 2. The FP problem is NP-complete when the 
c- graph G(V,E) is a DAG. 

Proof. We reduce the NP-complete VertexCover prob- 
lem to the FP problem on DAGs. We say that for an undi- 
rected graph G(V, E), a set A C V is a vertex cover of G, if 
every edge in E is incident to at least one node in A. For an 
instance of the VertexCover problem, let G(V, E) be an 
undirected graph and k an integer. The decision version of 
the problem asks for a set A C V of size k that is a vertex 
cover. 

Define the DAG G'(V, E') of the corresponding FP prob- 
lem as follows. Let V' = V U {s,t} contain the nodes in G, 
an additional source node s and an additional sink t. Let 
E' contain all edges in E. In addition to that, add an edge 
from the source to every node, and from every node to the 
sink. Fix an arbitrary order a of the nodes in V' , such that 
s is the first and t is the last in this ordering. Then direct 
every edge (w, t>) G E' from u to v if o~(u) < o(v), otherwise 
from v to it. This will naturally result in a DAG. Let m be 
an arbitrary integer such that m > fl(\V'\ 10 ). We will re- 
place every directed edge in E' (including the edges incident 
to s and t) with the following multiplier tool (Figure 12). 
For every edge (u, v) we add m new nodes: wi, W2, ■ ■ ■ , w m , 
and 2m new directed edges: (u,Wi) and (wi,v). Observe, 
that by this exchange, the size of the graph only changes by 
a polynomial factor of the original size. Now we will proof 
that there exists a vertex cover A of size at most k for this 
instance of the VertexCover problem if and only if there 
exists an FP A' of size k where <&(A' , V') < 0{m z ). In ad- 
dition we claim that A' C V and thus A = A 1 is the desired 
solution for the VertexCover. 

Let us assume A' is the solution of size k for the FP 
problem and V') < fi(m 3 ). We will show that A' C V 

and that A is a vertex cover of G. In special we will show 
that, if there is an edge (u, v) in E that is not incident 
to any node in A, then &(A' , V') > 0{m i ). As seen in 
Proposition 1, it is more advantageous to put the filter in 
the parent of a node with indegree 1, than in the node itself. 
For this reason, we can assume that filters are only placed in 
the nodes Wi of the multiplier tool, if all nodes with indegree 




Figure 12: "Multiplier edge" construction for G' . 
When x items leave u, x ■ m items arrive at v. 

larger than 1 are already filters. In this case, all nodes in 
V C V' would be filters, and then A is a trivial vertex 
cover. Thus we can assume A' C V. Now we will show 
that A is a vertex cover. Let us consider the subgraph G uv 
depicted in Figure 13, corresponding to the nodes u, v € V 
and the adjacent edges. <j; depicts the number of incoming 
items on that edge. Let £ u = cr" + + ■ ■ ■ + <r" and 
£„ = o\ + o~2 + ■ ■ ■ + &v v be the total number of items u and 
v receive from other nodes. 




Figure 13: Isolated subgraph G U v = {v,v,s,t} of G' . 

Let us assume that for every edge (u, v) G E at least one 
{it, v} is in A'. Then we can compute an upper bound on 
the number of items propagating in this subgraph, with re- 
spect to A' . There are three possible cases: 

case u G A', v G A': In this case the total number of 
items propagating on the edges of G uv is E„ + m + m + 
2m + T, v + m + rn + m + m, which is 0(m 2 ). 

case u G A', v ^ A': Then the total number of items 
propagating on the edges of G U v is (£„ + m) + 2m + (E„ + 
rn) + ((m + (E„ + m)) • m), which is 0(m 2 ); 

case u £ A' , v G A': Here the total number of items 
propagating on the edges of G uv is (S u + m) + 2 • ((£„ + 
m) ■ rn) + (£u + m) + m, which is 0(m 2 ); 

case u $l A', v ^ A': The total number of items prop- 
agating on the edges of G U v is (£„ + m) + 2 • ((E„ + rn) ■ 
m) + (E„ + rn) + ((((S„ + m) ■ m) + (E„ + rn)) ■ m), which 
is 0(m 4 ) for a worst-case E U ,E^ and 0(m 3 ) in the best case; 

The total number of subgraphs G U v in G' is \E\. Thus if 
A is a vertex cover in G, then for A the number of items 
is bounded $(A',V') = 0(n 2 x m 2 ) < m 3 . On the other 
hand let us assume that A = A' is not a vertex cover in G. 
This means that there is at least one edge (it, v) € E for 
which u £ A' and v £ A'. In this case &(A',V') = fi(m 3 ) 
in contradiction with our assumption. 

In this proof we showed that there exists a vertex cover 
of size k for G(V, E) if and only if there exists an FP A' of 
size k for G'(V', E'), with a bounded number of total items 
®(A', V') = 0(m 2 ). This provides a reduction of the Ver- 
texCover problem to FP thus making it NP-complete. □ 
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